Analytics and Tech Mining for Engineering Managers by Jan H. Kwakkel Scott W. Cunningham

Analytics and Tech Mining for Engineering Managers by Jan H. Kwakkel Scott W. Cunningham

Author:Jan H. Kwakkel, Scott W. Cunningham
Language: eng
Format: epub
Publisher: Momentum Press
Published: 2018-09-25T16:00:00+00:00


CHAPTER 6

PARSING TREE-STRUCTURED FILES

There are two high-level objectives of this chapter. The first is to discuss character encoding. Until now, we have discussed text as if there is a single and universal way to represent text on our computers. In a world that is multinational and multilingual, it is important to understand the nature of text encoding and, second, to be able to handle various encodings when doing text mining. The second goal is to expand our repertoire of parsing routines. We’ve given widely applicable examples of how to parse row-structured and column-structured files. Now it is time to turn to tree-structured data formats.

Many richly annotated forms of media are stored in a tree-structured format. Many of these media are highly relevant for monitoring science and technology. Web pages, such as news sites and wiki pages, provide valuable information. Home pages of firms are also of strong interest. In addition, a lot of science and technology contents are in the form of pdf files. Until recently, the pdf format has not been very accessible for machine reading. This chapter provides you tools useful for mining the pdf files.

We’ll be discussing the BeautifulSoup, pdfminer, and xmltodict packages in the examples to follow. These are all packages specifically adapted to the needs of reading tree-structured files and formats. Let us first discuss tree-structured files.

Trees have nodes and links. The terminology of trees resembles that of a family, so that there are parent and children nodes. Two children of the same parents are siblings. Children can also be parents, with their own children, extending the tree to multiple layers. Links between elements are preserved by container elements such as lists or dictionaries. If the children are to be accessed for any special purpose, or if the data is well-structured, a dictionary is often used. Otherwise a list is more commonly seen. More rarely, you may find children for whom the ordering must be preserved. More specialized structures including tables are then seen.

In another departure, in this chapter we address parsing the whole text instead of just an abstract. Whole texts are usually stored and structured as a tree. Much like an outline, a text tree contains sections and subsections, each embedded in the whole text. In addition whole texts often contain metadata or other nonreadable elements that are used in rendering the document. Whole texts also warrant new styles of analyses.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.